This document highlights my findings from an analysis of the Gapminder dataset. The Gapminder dataset contains different metrics/indicators, such as population, GDP per Capita, and CO2 Emissions per Capita, for 263 countries between the years 1962 and 2007. As part of the “R for Data Science” training for the Bioinformatics Research Network, I will analyze the dataset with respect to the following questions:
Before I get started with the analysis, I will import all the necessary libraries as well as the gapminder dataset.
library(tidyverse)
library(ggplot2)
library(plotly)
library(knitr)
library(DT)
gapminder <- read.csv("gapminder_clean.csv", header=TRUE)For the first part of the analysis, I attempt to analyze the relationship between CO2 Emissions (Metric Tons per Capita) and GDP per Capita for the year 1962. To do so, I filtered the gapminder dataset to the year 1962 and removed all records with null values. Below is a scatterplot of CO2 Emissions and GDP per Capita on the filtered dataset.
gapminder_1962 <- gapminder %>% filter(Year == 1962) %>% select(CO2.emissions..metric.tons.per.capita., gdpPercap) %>% na.omit()
gapminder_1962 %>% ggplot(aes(x=CO2.emissions..metric.tons.per.capita., y=gdpPercap)) +
geom_point(colour="black", shape=21, alpha = 1, stroke = 1, size=3, fill="orange") +
xlab("CO2 Emissions (Metric Tons) per Capita") +
ylab("GDP per Capita") +
ggtitle(label="CO2 Emissions vs. GDP per Capita, Year 1962") +
theme(legend.position = "right") +
theme_classic() As evident from the plot above, a lot of data points fall in the bottom left of the plot. As such, the relationship between CO2 Emissions (Metric Tons per Capita) and GDP per Capita is better described using a log scale plot, as shown in the plot below.
gapminder_1962 %>% ggplot(aes(x=CO2.emissions..metric.tons.per.capita., y=gdpPercap)) +
geom_point(colour="black", shape=21, alpha = 1, stroke = 1, size=3, fill="orange") +
xlab("CO2 Emissions (Metric Tons) per Capita") +
ylab("GDP per Capita") +
ggtitle(label="CO2 Emissions vs. GDP per Capita, Year 1962", subtitle = "Logarithmic Scale") +
geom_smooth(method = "lm", se = FALSE, colour = "red") +
theme(legend.position = "right") +
theme_classic() +
scale_x_log10() +
scale_y_log10()gapminder_1962_corr <- gapminder_1962 %>%
mutate(CO2Log = log(CO2.emissions..metric.tons.per.capita.), GDPLog = log(gdpPercap))
correlation <- cor.test(gapminder_1962_corr$CO2Log, gapminder_1962_corr$GDPLog, method="pearson")
cat("The correlation is ", correlation$estimate, "and P-value is ", correlation$p.value)## The correlation is 0.8602081 and P-value is 8.903567e-33
Performing a correlation test on the log transformed variables, the correlation coefficient was identified to be 0.8602081 and a p-value was identified to be 8.9035672^{-33}.
To determine the year in which the correlation between CO2 Emissions per Capita and GDP per Capita is the strongest, I took the raw dataset, removed records with null values, and performed a log transformation on CO2 Emissions per Capita and GDP per Capita. I then grouped the dataset by year and computed the correlation coefficient and p-value across the various time period. Afterwards, I sorted the correlation coefficient in descending order.
gapminder_corr <- gapminder %>%
select(Year, CO2.emissions..metric.tons.per.capita., gdpPercap) %>%
na.omit() %>%
mutate(CO2Log = log(CO2.emissions..metric.tons.per.capita.), GDPLog = log(gdpPercap)) %>%
group_by(Year) %>%
summarize(correl_coeff = cor.test(x = CO2Log, y = GDPLog, method = "pearson")$estimate, p_value = cor.test(x = CO2Log, y = GDPLog, method = "pearson")$p.value) %>%
arrange(desc(correl_coeff))
knitr::kable(gapminder_corr)| Year | correl_coeff | p_value |
|---|---|---|
| 2002 | 0.9300017 | 0 |
| 2007 | 0.9275425 | 0 |
| 1997 | 0.9208183 | 0 |
| 1982 | 0.9151610 | 0 |
| 1987 | 0.9080487 | 0 |
| 1977 | 0.9014278 | 0 |
| 1992 | 0.8934602 | 0 |
| 1972 | 0.8843802 | 0 |
| 1967 | 0.8736773 | 0 |
| 1962 | 0.8602081 | 0 |
As evident above, 2002 was the year with the greatest correlation between CO2 Emissions per Capita and GDP per Capita.
Filtering to the year 2002, I created an interactive plot using plotly, where the point size is determined by the population and the color is determined by the continent.
gapminder_2002 <- gapminder %>%
filter(Year == 2002) %>%
select(CO2.emissions..metric.tons.per.capita., gdpPercap, pop, continent) %>%
na.omit()
p <- gapminder_2002 %>% ggplot(aes(x = CO2.emissions..metric.tons.per.capita., y = gdpPercap, size = pop)) +
geom_point(aes(fill = continent), color = "black", pch= 21) +
xlab("CO2 Emissions (Metric Tons) per Capita") +
ylab("GDP per Capita") +
ggtitle(label = "CO2 Emissions vs. GDP per Capita, Year = 2002", subtitle = "Logarithmic Scale") +
scale_x_log10() +
scale_y_log10() +
theme_classic() +
theme(legend.position = "right")
ggplotly(p, width= 600, height = 600)To assess the relationship between Continent and Energy Use (kg of Oil per Capita) throughout the various time periods in the dataset, I first filtered the raw dataset to eliminate any records with null values and empty/blank continents. I then visualized the Energy Use/Consumption using a boxplot.
energy_continent <- gapminder %>%
select(continent, Energy.use..kg.of.oil.equivalent.per.capita.) %>%
na.omit() %>%
filter(continent != "")
p <- energy_continent %>% ggplot(aes(x = continent, y = Energy.use..kg.of.oil.equivalent.per.capita.)) +
geom_boxplot() +
xlab("Continents") +
ylab("Energy Use (kg of Oil) per Capita") +
ggtitle(label = "Energy Use per Capita by Continent", subtitle = "Logarithmic Scale") +
theme_classic() +
scale_y_log10()
ggplotly(p, width = 600, height = 600)Based on the plot above, there appears to be a difference between continents and energy use. In general, Energy Use per Capita is greatest in Oceania and lowest in Africa. To confirm if there is a statistically significant difference between these variables, a One-Way ANOVA test is performed. A One-Way ANOVA test is used because the data examines the relationship between a categorical independent variable with different levels (continent) and a continuous dependent variable (Energy use (kg of Oil) per Capita).
one_way <- aov(Energy.use..kg.of.oil.equivalent.per.capita. ~ continent, data = energy_continent)
summary(one_way)## Df Sum Sq Mean Sq F value Pr(>F)
## continent 4 7.715e+08 192870621 51.46 <2e-16 ***
## Residuals 843 3.160e+09 3748033
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The results of the One Way ANOVA analysis shows a p-value much less than 0.05. As such, there is a statistically significant difference between continents in terms of Energy Consumption.
To analyze the differences between Europe and Asia in terms of Imports of Goods and Services after 1990, I first filtered the raw dataset to extract data points in Europe and Asia and after 1990 and removed any records with null values. I first visualized the relationship between the two variables in all time-periods after 1990 using a boxplot.
continent_import <- gapminder %>%
select(Year, continent, Imports.of.goods.and.services....of.GDP.) %>%
na.omit() %>%
filter(Year > 1990, continent %in% c("Europe", "Asia")) %>%
mutate(Year = as.factor(Year))
p <- continent_import %>%
ggplot(aes(x = continent, y = Imports.of.goods.and.services....of.GDP.)) + geom_boxplot() + xlab("Continent") + ylab("Imports of Goods and Services (% of GDP)") + ggtitle(label = "Asia vs. Europe Imports after 1990") + theme_classic()
ggplotly(p, width = 600, height = 600)Looking at the plot above, there appears to be minimal difference between the average of Imports of Goods and Services across Europe and Asia. To assess if this difference is statistically significant, I perform a t-test. A t-test is used as the data examines the relationship between a 2 categorical variables (Asia and Europe) and a continuous dependent variable (Imports of Goods and Services).
t_test <- t.test(continent_import$Imports.of.goods.and.services....of.GDP. ~ continent, data = continent_import)
t_test##
## Welch Two Sample t-test
##
## data: continent_import$Imports.of.goods.and.services....of.GDP. by continent
## t = 1.3552, df = 137.53, p-value = 0.1776
## alternative hypothesis: true difference in means between group Asia and group Europe is not equal to 0
## 95 percent confidence interval:
## -2.321099 12.433240
## sample estimates:
## mean in group Asia mean in group Europe
## 46.84531 41.78924
As evident, the p-value is greater than 0.05 - thus, there is no statistically significant difference between Asian and Europe in terms of Imports of Goods and Services after 1990.
To determine the countries with the highest population density across the years, I started by filtering the dataset to remove records with null values. I then sorted the dataset on year and population density in descending order. Afterwards, for every year present in the dataset, I assign a rank to each country based on their row number - countries ranked as number 1 have the highest population density. Below is a plot of the countries with the highest Population Density across the different years.
top_density_year <- gapminder %>% select(Year, Country.Name, Population.density..people.per.sq..km.of.land.area.) %>%
na.omit() %>%
arrange(Year, desc(Population.density..people.per.sq..km.of.land.area.)) %>%
group_by(Year) %>%
mutate(rank = row_number()) %>%
filter(rank <= 10)
p <- top_density_year %>%
transform(Country.Name = reorder(Country.Name, - rank)) %>%
ggplot(aes(Population.density..people.per.sq..km.of.land.area., Country.Name)) + geom_point() + facet_wrap(~Year) + xlab("Population Density per Square KM") + ylab("Country Name")
ggplotly(p, width = 800, height = 800)Taking the average of the ranks across the different years, we identify the following as the top 10 countries with the highest population density across all the years.
top_density <- top_density_year %>%
group_by(Country.Name) %>%
summarize(avg_ranking = mean(rank)) %>%
arrange(avg_ranking) %>%
head(10)
knitr::kable(top_density)| Country.Name | avg_ranking |
|---|---|
| Macao SAR, China | 1.500000 |
| Monaco | 1.500000 |
| Hong Kong SAR, China | 3.100000 |
| Singapore | 3.900000 |
| Gibraltar | 5.000000 |
| Bermuda | 6.200000 |
| Malta | 7.000000 |
| Bahrain | 8.333333 |
| Channel Islands | 8.428571 |
| Bangladesh | 9.000000 |
Based on this analysis, Macao SAR China and Monaco have the highest population density across all the years.
To determine the country with the greatest difference in life expectancy between 1962 and 2007, I filtered the raw dataset to exclude records with null values and extracted the life expectancy for each country at the years 1962 and 2007. I then computed the difference in life expectancy between these two years. The following data table shows the top 20 countries with the greatest difference in life expectancy.
life_expectancy <- gapminder %>%
select(Year, Country.Name, Life.expectancy.at.birth..total..years.) %>%
na.omit() %>%
filter(Year == 1962 | Year == 2007) %>%
spread(Year, Life.expectancy.at.birth..total..years.) %>%
mutate(life_exp_diff = `2007` - `1962`) %>%
arrange(desc(life_exp_diff)) %>%
head(20)
datatable(life_expectancy %>% rename('Country Name' = Country.Name, 'Life Expectancy Difference' = life_exp_diff))We can visualize the difference in life expectancy using a plot shown below.
p <- life_expectancy %>%
transform(Country.Name = reorder(Country.Name, life_exp_diff)) %>%
ggplot(aes(Country.Name, life_exp_diff)) + geom_point() +
xlab("") + ylab("Difference in Life Expectancy between 2007 and 1962") +
labs(title = "Top 20 Countries with Greatest Increase in Life Expectancy") +
coord_flip() + theme_classic()
ggplotly(p, width = 800, height = 600)As evident from the table and plot above, Maldives has the greatest difference in life expectancy between 2007 and 1962 - increasing approximately 37 years since 1962.